What is the Indian Premier League(IPL)?
It is the world's most popular, most watched, most sought-after cricket league.
What is Criket?
Criket is 11 member team sport consisting of batters and bowlers. It can be played in a stadium of any shape and size as long as the center has 22-yard rectangular box called the pitch. There are two innings(phases) in a match and the team that wins the toss, get to pick what they want to do first (i.e bat first or bowl first). So the bowlers bowls to the batter and A batter scores by hitting the bowl to get runs whereas the bowler aims to get wickets (i.e. the batter out). There are multiple formats of the game but the IPL follows the T20 format which stands for 20-20 which simply means that each team get to bowl 20 overs (each over consists of 6 balls, which means every innings has a maximum of 120 balls). A team can win the match in two main ways:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as mlt
mlt.style.use('fivethirtyeight')
import seaborn as sns
# Used for interactive graphs
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
After a bit of googling, I found a data set that contains ball-by-ball data for IPL matches between 2008 and 2016 on kaagle: https://www.kaggle.com/manasgarg/ipl
This dataset consists of two CSV's (Common format of sharing data, stands for Comma Separated Value):
read_csv(). This function simply takes the path to the file and converts it to a DataFrame (which is a 2D data structure used to store data, like a SQL table).
matches = pd.read_csv('matches.csv')
deliveries = pd.read_csv('deliveries.csv')
matches.head()
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | result | dl_applied | winner | win_by_runs | win_by_wickets | player_of_match | venue | umpire1 | umpire2 | umpire3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2017 | Hyderabad | 2017-04-05 | Sunrisers Hyderabad | Royal Challengers Bangalore | Royal Challengers Bangalore | field | normal | 0 | Sunrisers Hyderabad | 35 | 0 | Yuvraj Singh | Rajiv Gandhi International Stadium, Uppal | AY Dandekar | NJ Llong | NaN |
| 1 | 2 | 2017 | Pune | 2017-04-06 | Mumbai Indians | Rising Pune Supergiant | Rising Pune Supergiant | field | normal | 0 | Rising Pune Supergiant | 0 | 7 | SPD Smith | Maharashtra Cricket Association Stadium | A Nand Kishore | S Ravi | NaN |
| 2 | 3 | 2017 | Rajkot | 2017-04-07 | Gujarat Lions | Kolkata Knight Riders | Kolkata Knight Riders | field | normal | 0 | Kolkata Knight Riders | 0 | 10 | CA Lynn | Saurashtra Cricket Association Stadium | Nitin Menon | CK Nandan | NaN |
| 3 | 4 | 2017 | Indore | 2017-04-08 | Rising Pune Supergiant | Kings XI Punjab | Kings XI Punjab | field | normal | 0 | Kings XI Punjab | 0 | 6 | GJ Maxwell | Holkar Cricket Stadium | AK Chaudhary | C Shamshuddin | NaN |
| 4 | 5 | 2017 | Bangalore | 2017-04-08 | Royal Challengers Bangalore | Delhi Daredevils | Royal Challengers Bangalore | bat | normal | 0 | Royal Challengers Bangalore | 15 | 0 | KM Jadhav | M Chinnaswamy Stadium | NaN | NaN | NaN |
matches['umpire3'].unique()
array([nan])
deliveries.head()
| match_id | inning | batting_team | bowling_team | over | ball | batsman | non_striker | bowler | is_super_over | ... | bye_runs | legbye_runs | noball_runs | penalty_runs | batsman_runs | extra_runs | total_runs | player_dismissed | dismissal_kind | fielder | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Sunrisers Hyderabad | Royal Challengers Bangalore | 1 | 1 | DA Warner | S Dhawan | TS Mills | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN |
| 1 | 1 | 1 | Sunrisers Hyderabad | Royal Challengers Bangalore | 1 | 2 | DA Warner | S Dhawan | TS Mills | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN |
| 2 | 1 | 1 | Sunrisers Hyderabad | Royal Challengers Bangalore | 1 | 3 | DA Warner | S Dhawan | TS Mills | 0 | ... | 0 | 0 | 0 | 0 | 4 | 0 | 4 | NaN | NaN | NaN |
| 3 | 1 | 1 | Sunrisers Hyderabad | Royal Challengers Bangalore | 1 | 4 | DA Warner | S Dhawan | TS Mills | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN |
| 4 | 1 | 1 | Sunrisers Hyderabad | Royal Challengers Bangalore | 1 | 5 | DA Warner | S Dhawan | TS Mills | 0 | ... | 0 | 0 | 0 | 0 | 0 | 2 | 2 | NaN | NaN | NaN |
5 rows × 21 columns
matches['team1'].unique()
array(['Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions',
'Rising Pune Supergiant', 'Royal Challengers Bangalore',
'Kolkata Knight Riders', 'Delhi Daredevils', 'Kings XI Punjab',
'Chennai Super Kings', 'Rajasthan Royals', 'Deccan Chargers',
'Kochi Tuskers Kerala', 'Pune Warriors', 'Rising Pune Supergiants'],
dtype=object)
When we look at the data, we realize that in the matches table(DataFrame), the umpire3 column is empty, filled with NaN's. So we can go ahead and and drop that column from our table.
When we look at the deliveries table, we see that a few entries in columns are NaN's.
Now, we must not drop these column (because they indicate something important when they actually contain a value) but we must make these columns useful and usable to ourselves. To do this, I will replace all NaN values in the deliveries table with 0's.
Doing this in this situation is very useful as we are not missing data, instead here, NaN's are used to indicate that the specific event did not occur. For this data, this is the best way to go forward but this might not always be the best choice, it always will depend on the type of data and analysis we want to perform.
For further simplicity I am going to abbreviate team names so that it is easier for us to type, display and utilize the same. We got all the unique teams in the data by running matches['team1'].unique()
# Dropping the umpire3 Column
matches.drop(['umpire3'],axis=1,inplace=True)
#Filling values in the delivery table with 0's
deliveries.fillna(0,inplace=True)
#Replacing the Team Names with their abbreviations
matches.replace(['Mumbai Indians','Kolkata Knight Riders','Royal Challengers Bangalore','Deccan Chargers','Chennai Super Kings',
'Rajasthan Royals','Delhi Daredevils','Gujarat Lions','Kings XI Punjab',
'Sunrisers Hyderabad','Rising Pune Supergiants','Kochi Tuskers Kerala','Pune Warriors','Rising Pune Supergiant']
,['MI','KKR','RCB','DC','CSK','RR','DD','GL','KXIP','SRH','RPS','KTK','PW','RPS'],inplace=True)
print('Total Matches Played:',matches.shape[0])
print('Total venues played at:',matches['city'].nunique())
print(' \n Venues Played At:',matches['city'].unique())
print(' \n Teams :',matches['team1'].unique())
print('MVP:',(matches['player_of_match'].value_counts()).idxmax())
print('Team with most wins:',((matches['winner']).value_counts()).idxmax())
Total Matches Played: 636 Total venues played at: 30 Venues Played At: ['Hyderabad' 'Pune' 'Rajkot' 'Indore' 'Bangalore' 'Mumbai' 'Kolkata' 'Delhi' 'Chandigarh' 'Kanpur' 'Jaipur' 'Chennai' 'Cape Town' 'Port Elizabeth' 'Durban' 'Centurion' 'East London' 'Johannesburg' 'Kimberley' 'Bloemfontein' 'Ahmedabad' 'Cuttack' 'Nagpur' 'Dharamsala' 'Kochi' 'Visakhapatnam' 'Raipur' 'Ranchi' 'Abu Dhabi' 'Sharjah' nan] Teams : ['SRH' 'MI' 'GL' 'RPS' 'RCB' 'KKR' 'DD' 'KXIP' 'CSK' 'RR' 'DC' 'KTK' 'PW'] MVP: CH Gayle Team with most wins: MI
We've used a few pandas library functions to perfrom some basics analysis.
As we can see, over the years, the tournament has been played at 30 locations (grounds, to be more precise) most of these locations are in India, but if we look closely, some of them are in the UK, UAE and South Africa!!
The MVP, with the most Man of the Match awards is none other than Chris Gayle! (If you follow cricket closely, you'll know what an impact the player has on the team and the game)
Lets look at every teams success so far:
# Lets Calculate and plot the number of wins by each team across seasons!
team_wins = matches['winner'].value_counts()
team_wins_df = pd.DataFrame(columns=["team", "wins"])
for items in team_wins.iteritems():
temp_df1 = pd.DataFrame({
'team':[items[0]],
'wins':[items[1]]
})
team_wins_df = team_wins_df.append(temp_df1, ignore_index=True)
mlt.title("Total Victories of IPL Teams")
sns.barplot(x='wins', y='team', data=team_wins_df, palette='Paired');
As we can see, MI (Mumbai Indians) is the most successful IPL team, infact they've the tournamnet 5 times now!
As we can see, visualization is an important aspect of data analysis, it lets us put things in perspective. Here, we used the matplotlib and seaborn libraries to plot the above graph. We created a barplot with using the seaborn library barplot function. We always have to send data to a visualization function and sending the right data is very important. So, above, we created a new dataframe called team_wins_df that holds the exact data we need to plot the above graph.
In the game of cricket, toss is a very important factor. Cricket gives the team winning the toss an edge, because they get their preffered decision (to either bowl first or bat first) this decision is made by the captain and coach after taking into considerations like ground size, due(mist) factor, opposition, time of the day etc.
Lets take a closer look at Toss, Toss Decisions and the way it affects the game
print('Toss Decisions in %:\n',((matches['toss_decision']).value_counts())/matches.shape[0]*100)
Toss Decisions in %: field 57.075472 bat 42.924528 Name: toss_decision, dtype: float64
We use matches.shape[0] to get the total number of matches and we use matches['toss_decision'].value_counts() to count the occurence of each event.
As we can see 57% of the teams winning the toss decide to Field (Ball) first and the rest chose to Bat first
# using a countplot to show toss decisions.
mlt.subplots(figsize=(10,6))
sns.countplot(x='season',hue='toss_decision',data=matches)
mlt.show()
2016 was the year there was the hishest divide between teams choosing to feild first vs those chosing to bat first while 2012 was the season with the lowest divide
Unsrprisingly, across years, there is a lot of variation as to how teams choose a decision at a toss. This is can be because of the location the teams were playing: Some conditions call for feilding first while others call for batting first.
Lets see where the 2012 and 2017 season were played.
matches[matches.season == 2012].head(3)
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | result | dl_applied | winner | win_by_runs | win_by_wickets | player_of_match | venue | umpire1 | umpire2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 307 | 308 | 2012 | Chennai | 2012-04-04 | CSK | MI | MI | field | normal | 0 | MI | 0 | 8 | RE Levi | MA Chidambaram Stadium, Chepauk | JD Cloete | SJA Taufel |
| 308 | 309 | 2012 | Kolkata | 2012-04-05 | KKR | DD | DD | field | normal | 0 | DD | 0 | 8 | IK Pathan | Eden Gardens | S Asnani | HDPK Dharmasena |
| 309 | 310 | 2012 | Mumbai | 2012-04-06 | PW | MI | MI | field | normal | 0 | PW | 28 | 0 | SPD Smith | Wankhede Stadium | AK Chaudhary | SJA Taufel |
matches[matches.season == 2017].head(3)
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | result | dl_applied | winner | win_by_runs | win_by_wickets | player_of_match | venue | umpire1 | umpire2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2017 | Hyderabad | 2017-04-05 | SRH | RCB | RCB | field | normal | 0 | SRH | 35 | 0 | Yuvraj Singh | Rajiv Gandhi International Stadium, Uppal | AY Dandekar | NJ Llong |
| 1 | 2 | 2017 | Pune | 2017-04-06 | MI | RPS | RPS | field | normal | 0 | RPS | 0 | 7 | SPD Smith | Maharashtra Cricket Association Stadium | A Nand Kishore | S Ravi |
| 2 | 3 | 2017 | Rajkot | 2017-04-07 | GL | KKR | KKR | field | normal | 0 | KKR | 0 | 10 | CA Lynn | Saurashtra Cricket Association Stadium | Nitin Menon | CK Nandan |
Turns out, both the 2012 season and 2017 season were played in India! So it might not be the location then. It could just be a shift in strategy by coaches and teams
mlt.subplots(figsize=(10,6))
ax=matches['toss_winner'].value_counts().plot.bar(width=0.9,color=sns.color_palette('RdYlGn',20))
for p in ax.patches:
ax.annotate(format(p.get_height()), (p.get_x()+0.15, p.get_height()+1))
mlt.show()
As we can see, MI has won the most tosses! also this barplot is very similar to the barplot above labelled "Total Victories of IPL Teams". How is it similar, we can see that most of the teams that are successful are also good at winning tosses! So toss can be an important factor in cricket! Please note, it is important to understand that this chart, in no way, indicates that MI and other teams winning a lot of tosses usually have a higher chance of winning the toss. That is the teams on the lower end GL, RPS, KTK do not have a bad chance at winning the toss, they just did not play enough games!!. This is a very important observation
# creating a pie chart using matplotlib.pie
df=matches[matches['toss_winner']==matches['winner']]
slices=[len(df),(636-len(df))]
labels=['Toss winners winning','Toss winners losing']
mlt.pie(slices,labels=labels,startangle=90,shadow=True,explode=(0,0.05),autopct='%1.1f%%',colors=['g','r'])
fig = mlt.gcf()
fig.set_size_inches(6,6)
mlt.show()
Important fact: The probability of winning the toss is equal for both the teams as a 2 sided coin is flipped at the toss.
From the above pie chart, we see that winning the toss does not necessarily mean winning the match!
At this point, It is also nice to appreciate the fact that we have all these inbuilt library functions that make our lives easy. All we have to do is filter data to a format that would suit the visualixation we choose.
Above, we used the matplotlib pie method to create the pie chart.
Lets see how the Toss and other factors impact the tournament decider!
# Creating a finals_df that contains only the last/final match of every season
finals_df=matches.drop_duplicates(subset=['season'],keep='last')
finals_df=finals_df[['id','season','city','team1','team2','toss_winner','toss_decision','winner']]
# creating a temp df to create a pie chart.
df=finals_df[finals_df['toss_winner']==finals_df['winner']]
slices=[len(finals_df),(9-len(df))]
labels=['Toss winners winning','Toss winners losing']
mlt.pie(slices,labels=labels,startangle=90,shadow=True,colors=['g','r'],explode=(0,0.1),autopct='%1.1f%%')
fig = mlt.gcf()
fig.set_size_inches(5,5)
mlt.show()
WOW! 83% of teams that win the toss in the decider with the match!
This could just beacuse the team winning the toss is under less pressure but there could definitely be other factors that are not visible in this data. Toss is definitely an important factor when it comes to final
Lets see How a decision after winning the toss affects the outcome of the game:
# small and simple countplot to see the how the decision to field/bat affects
# the outcome of the tournament decider/final
finals_df['Final_winner']=finals_df['toss_winner']==finals_df['winner']
sns.countplot(x='toss_decision',hue='Final_winner',data=finals_df)
mlt.show()
Above, True=Winning/Won and False=Losing/Lost.
Looks like the team winning the toss choosing to Bat first has won the final the most times. Captains and coaches should definitely take this into consideration! while making a decison in the final!
I will explain what the significance of each statistic is and how it affects the team.
Batters are responsible for the runs each team scores.The more the runs, the harder it is for the team batting second to win the match, this also gives the team bowling second a good leeway to get the batters of the team batting second out. So, in simple terms, The more the runs a team scores, the better chance they have at winning the match. This an indicator of how competetive teams are.
To do this, we create batters_df to store the only batter data we need. The data we need exists in both the matches table and the deliveries table. We will be using the match_id column to our advatage here to merge the two tables(as it is unique for every match) using a left join. into seasons_df.
The generated graph shows that, in genral, teams (all together) have increased the number of runs scored across seasons. Just this (isolated) indicates that the tournament was the most competitve in the 2013 season. There is sharp dip from there on in terms of the runs being scored.
# Creating the batters_df
batters_df = matches[['id','season']].merge(deliveries, left_on = 'id', right_on = 'match_id', how = 'left').drop('id', axis = 1)
# merging the matches and deliveries dataframe by using the id and match_id columns respectively
season_df=batters_df.groupby(['season'])['total_runs'].sum().reset_index()
season_df.set_index('season').plot(marker='o')
mlt.gcf().set_size_inches(10,6)
mlt.title('Total Runs Across the Seasons')
mlt.show()
This is the same statistic as above except calculated per match.
This gives a more granular look at how the tournamnet had progesses across seasons.
We do this in the same way as above by creating a new dataframe with only the data we need.
The graph genr
avgruns_each_season=matches.groupby(['season']).count().id.reset_index()
avgruns_each_season.rename(columns={'id':'matches'},inplace=1)
avgruns_each_season['total_runs']=season_df['total_runs']
avgruns_each_season['average_runs_per_match']=avgruns_each_season['total_runs']/avgruns_each_season['matches']
avgruns_each_season.set_index('season')['average_runs_per_match'].plot(marker='o')
mlt.gcf().set_size_inches(10,6)
mlt.title('Average Runs per match across Seasons')
mlt.show()
matches_played_byteams=pd.concat([matches['team1'],matches['team2']])
matches_played_byteams=matches_played_byteams.value_counts().reset_index()
matches_played_byteams.columns=['Team','Total Matches']
matches_played_byteams['wins']=matches['winner'].value_counts().reset_index()['winner']
matches_played_byteams.set_index('Team',inplace=True)
trace1 = go.Bar(
x=matches_played_byteams.index,
y=matches_played_byteams['Total Matches'],
name='Total Matches'
)
trace2 = go.Bar(
x=matches_played_byteams.index,
y=matches_played_byteams['wins'],
name='Matches Won'
)
data = [trace1, trace2]
layout = go.Layout(
barmode='stack'
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='stacked-bar')
finals_df
| id | season | city | team1 | team2 | toss_winner | toss_decision | winner | Final_winner | |
|---|---|---|---|---|---|---|---|---|---|
| 58 | 59 | 2017 | Hyderabad | MI | RPS | MI | bat | MI | True |
| 116 | 117 | 2008 | Mumbai | CSK | RR | RR | field | RR | True |
| 173 | 174 | 2009 | Johannesburg | DC | RCB | RCB | field | DC | False |
| 233 | 234 | 2010 | Mumbai | CSK | MI | CSK | bat | CSK | True |
| 306 | 307 | 2011 | Chennai | CSK | RCB | CSK | bat | CSK | True |
| 380 | 381 | 2012 | Chennai | CSK | KKR | CSK | bat | KKR | False |
| 456 | 457 | 2013 | Kolkata | MI | CSK | MI | bat | MI | True |
| 516 | 517 | 2014 | Bangalore | KXIP | KKR | KKR | field | KKR | True |
| 575 | 576 | 2015 | Kolkata | MI | CSK | CSK | field | MI | False |
| 635 | 636 | 2016 | Bangalore | SRH | RCB | SRH | bat | SRH | True |
matches['toss_winner']
0 RCB
1 RPS
2 KKR
3 KXIP
4 RCB
...
631 RCB
632 RCB
633 KKR
634 SRH
635 SRH
Name: toss_winner, Length: 636, dtype: object
temp = df=matches[matches['toss_winner']==matches['winner']]
temp
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | result | dl_applied | winner | win_by_runs | win_by_wickets | player_of_match | venue | umpire1 | umpire2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2017 | Pune | 2017-04-06 | MI | RPS | RPS | field | normal | 0 | RPS | 0 | 7 | SPD Smith | Maharashtra Cricket Association Stadium | A Nand Kishore | S Ravi |
| 2 | 3 | 2017 | Rajkot | 2017-04-07 | GL | KKR | KKR | field | normal | 0 | KKR | 0 | 10 | CA Lynn | Saurashtra Cricket Association Stadium | Nitin Menon | CK Nandan |
| 3 | 4 | 2017 | Indore | 2017-04-08 | RPS | KXIP | KXIP | field | normal | 0 | KXIP | 0 | 6 | GJ Maxwell | Holkar Cricket Stadium | AK Chaudhary | C Shamshuddin |
| 4 | 5 | 2017 | Bangalore | 2017-04-08 | RCB | DD | RCB | bat | normal | 0 | RCB | 15 | 0 | KM Jadhav | M Chinnaswamy Stadium | NaN | NaN |
| 5 | 6 | 2017 | Hyderabad | 2017-04-09 | GL | SRH | SRH | field | normal | 0 | SRH | 0 | 9 | Rashid Khan | Rajiv Gandhi International Stadium, Uppal | A Deshmukh | NJ Llong |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 629 | 630 | 2016 | Kanpur | 2016-05-21 | MI | GL | GL | field | normal | 0 | GL | 0 | 6 | SK Raina | Green Park | AK Chaudhary | CK Nandan |
| 631 | 632 | 2016 | Raipur | 2016-05-22 | DD | RCB | RCB | field | normal | 0 | RCB | 0 | 6 | V Kohli | Shaheed Veer Narayan Singh International Stadium | A Nand Kishore | BNJ Oxenford |
| 632 | 633 | 2016 | Bangalore | 2016-05-24 | GL | RCB | RCB | field | normal | 0 | RCB | 0 | 4 | AB de Villiers | M Chinnaswamy Stadium | AK Chaudhary | HDPK Dharmasena |
| 634 | 635 | 2016 | Delhi | 2016-05-27 | GL | SRH | SRH | field | normal | 0 | SRH | 0 | 4 | DA Warner | Feroz Shah Kotla | M Erasmus | CK Nandan |
| 635 | 636 | 2016 | Bangalore | 2016-05-29 | SRH | RCB | SRH | bat | normal | 0 | SRH | 8 | 0 | BCJ Cutting | M Chinnaswamy Stadium | HDPK Dharmasena | BNJ Oxenford |
325 rows × 17 columns